Dataset: Social media data

https://github.com/abulbasar/data/blob/master/snsdata.csv?raw=true

  • Use 36 features - "basketball", "football" … "drunk", "drugs" (these columns indicate how many times a user has used these words in her profile) and apply K-Means clustering to group the profiles into 5 clusters
  • Find the number of users in each cluster and mean distance with each cluster.
  • Which cluster is the most dense in terms of average distance.
  • How many anomalies are there?
  • For each cluster, find the top 3 dominant features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import *

%matplotlib inline

In [2]:
df = pd.read_csv("https://github.com/abulbasar/data/blob/master/snsdata.csv?raw=true")
df.head()


Out[2]:
gradyear gender age friends basketball football soccer softball volleyball swimming ... blonde mall shopping clothes hollister abercrombie die death drunk drugs
0 2006 M 18.982 7 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2006 F 18.801 0 0 1 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 2006 M 18.335 69 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 2006 F 18.875 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 2006 NaN 18.995 10 0 0 0 0 0 0 ... 0 0 2 0 0 0 0 0 1 1

5 rows × 40 columns

Proportion of male/female profiles in the dataset


In [3]:
df.gender.value_counts()/len(df)


Out[3]:
F    0.735133
M    0.174067
Name: gender, dtype: float64

In [4]:
features = df.columns[4:]

In [5]:
X = df[features] * 1.0
a = X.values.flatten()

fig, _ = plt.subplots(2, 1, figsize = (8, 6))
axes = fig.axes 

axes[0].hist(X.values.flatten(), bins = 50, log = True);
axes[0].set_title("Histogram of frequencies")
axes[1].boxplot(a, vert = False);
axes[1].set_title("Boxplot of frequencies")
plt.tight_layout()



In [6]:
a.shape


Out[6]:
(1080000,)

How many records are there for which count is greater than 20.


In [7]:
a[a>20].shape


Out[7]:
(52,)

Clip the count values beyond 50.


In [8]:
X_clipped = np.clip(X.values, a_min=0, a_max=50)
plt.hist(X_clipped.flatten(), log=True, bins = 50);


Before applying KMeans, standarized the values.


In [9]:
scaler = preprocessing.MinMaxScaler()
X_std = scaler.fit_transform(X_clipped)

Set the number of clusters to k=5.


In [10]:
k = 5

In [11]:
kmeans = cluster.KMeans(n_clusters=k, random_state=1)
kmeans.fit(X_std)


Out[11]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)

Predict cluster for each point based on the KMeans model.


In [12]:
y_pred = kmeans.predict(X_std)

Centroids of the clusters in the normal scale (using scaler.inverse_transform).


In [13]:
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
centroids.T


Out[13]:
0 1 2 3 4
basketball 2.185926e-01 0.449442 0.436441 0.412173 0.282163
football 2.111973e-01 0.377216 0.549435 0.360555 0.273462
soccer 1.931401e-01 0.367695 0.264124 0.276194 0.249845
softball 1.303934e-01 0.288903 0.238701 0.251541 0.161591
volleyball 1.103398e-01 0.342416 0.211864 0.181818 0.122436
swimming 1.041695e-01 0.267892 0.189266 0.202234 0.162213
cheerleading 3.017104e-02 0.085686 2.888418 0.058937 0.046613
baseball 9.568531e-02 0.119829 0.185028 0.143297 0.106277
tennis 7.817250e-02 0.136901 0.076271 0.108629 0.089497
sports 1.191870e-01 0.175312 0.168079 0.223421 0.210690
cute 2.137834e-01 0.800394 0.576271 0.556626 0.424487
sex 1.215916e-01 0.266907 0.300847 0.447612 0.839030
sexy 1.156935e-01 0.186146 0.203390 0.208783 0.269111
hot 8.343542e-02 0.314183 0.251412 0.211479 0.170914
kissed 4.645887e-02 0.136573 0.189266 0.325886 0.420137
dance 3.249399e-01 0.832239 0.668079 0.661017 0.540087
band 2.689079e-01 0.338805 0.261299 0.433359 0.436917
marching 3.978948e-02 0.040053 0.024011 0.053929 0.038533
music 6.345901e-01 0.984570 0.748588 1.137519 1.026725
rock 1.959076e-01 0.345043 0.360169 0.389445 0.413300
god 4.086475e-01 0.561064 0.608757 0.610555 0.743319
church 1.971780e-01 0.468155 0.365819 0.379815 0.266004
jesus 1.033075e-01 0.149048 0.115819 0.135978 0.121815
bible 1.851096e-02 0.021668 0.026836 0.036210 0.032940
hair 2.422758e-01 0.754760 0.661017 1.065485 1.121193
dress 5.784674e-02 0.387722 0.146893 0.194530 0.164077
blonde 5.389955e-02 0.166776 0.289548 0.173344 0.211311
mall 1.259471e-01 1.013132 0.418079 0.397535 0.330019
shopping 1.407831e-01 1.778070 0.707627 0.439137 0.267247
clothes 7.291390e-14 0.174327 0.217514 1.355162 0.156619
hollister 3.039789e-02 0.253775 0.211864 0.151772 0.067744
abercrombie 2.146001e-02 0.192055 0.158192 0.100924 0.064015
die 1.455923e-01 0.223572 0.190678 0.287365 0.467371
death 9.305385e-02 0.155942 0.127119 0.167180 0.234307
drunk -3.529121e-14 0.048260 0.086158 0.050847 1.428838
drugs 3.026178e-02 0.073867 0.081921 0.124037 0.336234

For first cluster, find top 10 the most dominant features based on the magnitude.


In [14]:
centroids.iloc[0, :].T.sort_values(ascending = False)[:10]


Out[14]:
music         0.634590
god           0.408648
dance         0.324940
band          0.268908
hair          0.242276
basketball    0.218593
cute          0.213783
football      0.211197
church        0.197178
rock          0.195908
Name: 0, dtype: float64

For first cluster, music, god, dance, hair etc. are dominant features. Let's see the dominant features in the other clusters.


In [15]:
centroids.iloc[1, :].T.sort_values(ascending = False)[:10]


Out[15]:
shopping      1.778070
mall          1.013132
music         0.984570
dance         0.832239
cute          0.800394
hair          0.754760
god           0.561064
church        0.468155
basketball    0.449442
dress         0.387722
Name: 1, dtype: float64

In [16]:
centroids.iloc[2, :].T.sort_values(ascending = False)[:10]


Out[16]:
cheerleading    2.888418
music           0.748588
shopping        0.707627
dance           0.668079
hair            0.661017
god             0.608757
cute            0.576271
football        0.549435
basketball      0.436441
mall            0.418079
Name: 2, dtype: float64

In [17]:
centroids.iloc[3, :].T.sort_values(ascending = False)[:10]


Out[17]:
clothes       1.355162
music         1.137519
hair          1.065485
dance         0.661017
god           0.610555
cute          0.556626
sex           0.447612
shopping      0.439137
band          0.433359
basketball    0.412173
Name: 3, dtype: float64

In [18]:
centroids.iloc[4, :].T.sort_values(ascending = False)[:10]


Out[18]:
drunk     1.428838
hair      1.121193
music     1.026725
sex       0.839030
god       0.743319
dance     0.540087
die       0.467371
band      0.436917
cute      0.424487
kissed    0.420137
Name: 4, dtype: float64

As music and god are common in top 10 for each cluster, we can drop these features and retry the clustering.

Find the density of each cluster. Calculate the avg distance of a point and its closes centroid.


In [19]:
df["cluster"] = y_pred

distances = np.zeros(len(y_pred))
for i in range(k):
    center = kmeans.cluster_centers_[i]
    distances[y_pred == i] = metrics.euclidean_distances(X_std[y_pred == i]
                                                        , center.reshape(1, -1)).squeeze()
df["distance"] = distances
df.sample(10)


Out[19]:
gradyear gender age friends basketball football soccer softball volleyball swimming ... shopping clothes hollister abercrombie die death drunk drugs cluster distance
16893 2008 F 16.364 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0.046323
18432 2008 F 16.690 23 0 0 0 1 0 0 ... 1 0 0 0 0 0 1 0 4 0.305214
2962 2006 M 17.960 51 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0.043324
6076 2006 F 18.319 54 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0.109869
14262 2007 M 18.215 14 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 4 0.107005
16646 2008 M 16.304 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 0.123572
26571 2009 F 16.085 30 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0.082439
20107 2008 F 16.528 119 1 0 0 1 0 0 ... 0 0 0 0 1 0 2 1 4 0.294885
13933 2007 NaN NaN 8 0 0 0 0 0 1 ... 1 0 0 0 0 0 0 0 1 0.233036
12905 2007 M 17.695 16 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0.087391

10 rows × 42 columns


In [20]:
df.pivot_table("distance", "cluster", "gender", aggfunc="mean")


Out[20]:
gender F M
cluster
0 0.131036 0.126245
1 0.281103 0.333438
2 0.260543 0.266780
3 0.231254 0.225874
4 0.232222 0.205003

Let's find the anomalies in the features depending the distance of the profile from its centroid. Using Box Whisker method, identify the outliers.


In [21]:
def find_outliers(a):
    q1, q2, q3 = np.percentile(a, [25, 50, 75])
    iqr = q3 - q1
    lower_whisker = max(q1 - 1.5 * iqr, np.min(a))
    upper_whisker = min(q3 + 1.5 * iqr, np.max(a))
    q1, q2, q3, iqr, lower_whisker, upper_whisker
    is_outlier = (a < lower_whisker) | (a > upper_whisker)
    return is_outlier

In [22]:
anamolies = df[find_outliers(df.distance)]
anamolies.shape, df.shape


Out[22]:
((1543, 42), (30000, 42))

In [23]:
df.distance.plot.hist(bins = 50, log = True)


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27bde860>

Apply dendo-gram (hierarchical clustering)


In [24]:
from scipy.cluster.hierarchy import linkage, dendrogram

In [25]:
plt.figure(figsize = (15, 10))
row_clusters = linkage(X_std, method="complete", metric="euclidean")
f = dendrogram(row_clusters, p = 5, truncate_mode="level")



In [ ]: